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Abstract 

Statistical models with constrained probability distributions are abundant in machine 
learning. Some examples include regression models with norm constraints (e.g., Lasso), 
probit, many copula models, and latent Dirichlet allocation (LDA). Bayesian inference in¬ 
volving probability distributions confined to constrained domains could be quite challenging 
for commonly used sampling algorithms. In this paper, we propose a novel augmentation 
technique that handles a wide range of constraints by mapping the constrained domain to 
a sphere in the augmented space. By moving freely on the surface of this sphere, sampling 
algorithms handle constraints implicitly and generate proposals that remain within bound¬ 
aries when mapped back to the original space. Our proposed method, called Spherical 
Augmentation, provides a mathematically natural and computationally efficient frame¬ 
work for sampling from constrained probability distributions. We show the advantages of 
our method over state-of-the-art sampling algorithms, such as exact Hamiltonian Monte 
Carlo, using several examples including truncated Gaussian distributions, Bayesian Lasso, 
Bayesian bridge regression, reconstruction of quantized stationary Gaussian process, and 
LDA for topic modeling. 

Keywords: Gonstrained probability distribution; Geodesic; Hamiltonian; Monte Carlo; 
Lagrangian Monte Carlo 

1. Introduction 

Many commonly used statistical models in Bayesian analysis involve high-dimensional prob¬ 
ability distributions confined to constrained domains. Some examples include regression 
models with norm constraints (e.g., Lasso), probit, many copula models, and latent Dirich¬ 
let allocation (LDA). Very often, the resulting models are intractable and simulating sam- 


pies for Monte Carlo estimations is quite challenging (Neal and Roberts, 2008, S 

herlock 

and Roberts 

21 Ml!) 

Neal et al. 

2012 

Brubaker et al. 

2012; |Pakman and Paninski 

2013). 


Although the literature on improving the efficiency of computational methods for Bayesian 


inference is quite extensive (see, for example. 

Neal 

1996 

1993[ 

Geyer, 1992[ 

Mykland et al. 

1995; Propp and WilsonH1996 Roberts and Sahu 

1997 

Gilks et al., 1998 

Warnes, 2001 
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20031 Mpller et ah 
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2006( Kurihara et al. 

20061 Cappe et ah 

2008 

Grain et ah[ 
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2010 

Randal et al., 2007| Randal and P. 

20111 Welling 

and Teh, 20111 Zhang and Sutton 

2011 

Ahmadian et ah, 20L 

Girolami and Calderheadl 

201 1[ Hoffman and Gelman 

2011[|Beskos et ah, 2011[ Calderhead and Sustih 

., 2012 Shah- 

baba et al., 2014[ 

Ahn et al. 

2013[ Lan et al. 

Mlj 

\.hn et ah 

2014), these methods do not 


directly address the complications due to constrained target distributions. When dealing 
with such distributions, MCMC algorithms typically evaluate each proposal to ensure it is 
within the boundaries imposed by the constraints. Computationally, this is quite inefficient, 
especially in high dimensional problems where proposals are very likely to miss the con¬ 
strained domain. Alternatively, one could map the original domain to the entire Euclidean 
space to remove the boundaries. This approach too is computationally inefficient since the 
sampler needs to explore a much larger space than needed. 


In this paper, we propose a novel method, called Spherical Augmentation, for handling 
constraints involving norm inequalities (Figure [^. Our proposed method augments the pa¬ 
rameter space and maps the constrained domain to a sphere in the augmented space. The 
sampling algorithm explores the surface of this sphere. This way, it handles constraints 
implicitly and generates proposals that remain within boundaries when mapped back to 
the original space. While our method can be applied to all Metropolis-based sampling 


algorithms, we mainly focus on methods based on Hamiltonian Monte Carlo (HMC) (Du¬ 


ane et ah, 1987; Neal, 2011). As discussed by Neal (2011), one could modify standard 


HMC such that the sampler bounces off the boundaries by letting the potential energy go 
to infinity for parameter values that violate the constraints. This creates “energy walls” 
at boundaries. This approach, henceforth called Wall HMC, has limited applications and 
tends to be computationally inefficient, because the frequency of hitting and bouncing in¬ 
creases exponentially as dimension grows. Byrne and Girolami (2013) discuss an alternative 


approach for situations where constrained domains can be identihed as sub-manifolds. Pak- 


man and Paninski (2013) follow the idea of Wall HMC and propose an exact HMC algorithm 


specifically for truncated Caussian distributions with non-holonomic constraints. Brubaker 


et al. (2012) on the other hand propose a modified version of HMC for handling holonomic 
constraint c(0) = 0. All these methods provide interesting solutions for specific types of 
constraints. In contrast, our proposed method offers a general and efficient framework 
applicable to a wide range of problems. 


The paper is structured as follows. Before presenting our methods, in Section we 
provide a brief overview of HMC and one of its variants, namely, Lagrangian Monte Carlo 
(LMC) (Lan et ah, 2014). We then present the underlying idea of spherical augmentation, 
first for two simple cases, ball type (Section |3.1[ ) and box type (Section |3.2[ ) constraints, then 
for more general g-norm type constraints (Section |3.3[ ), as well as some functional constraints 
(Section |3.4[ ). In Sectio n [4| we apply the spherical augmentation technique to HMC (Section 
|4.2[ ) and LMC (Section 4.3) for sampling from constrained target distributions. We evaluate 
our proposed methods using simulated and real data in Section Finally, Section is 
devoted to discussion and future directions. 
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Figure 1: gr-norm constraints 


2. Preliminaries 

2.1 Hamiltonian Monte Carlo 

HMC improves upon random walk Metropolis (RWM) by proposing states that are distant 
from the current state, but nevertheless accepted with high probability. These distant 
proposals are found by numerically simulating Hamiltonian dynamics, whose state space 
consists of its position, denoted by the vector 6, and its momentum, denoted by the vector 
p. Our objective is to sample from the continuous probability distribution of 6 with the 
density function f{0). It is common to assume that the fictitious momentum variable 
p ~ AA(0, M), where M is a symmetric, positive-definite matrix known as the mass matrix, 
often set to the identity matrix I for convenience. 

In this Hamiltonian dynamics, the potential energy, U{6), is defined as minus the log 
density of 6 (plus any constant), that is U{6) := — log/(0); the kinetic energy, K{p) for the 
auxiliary momentum variable p is set to be minus the log density of p (plus any constant). 
Then the total energy of the system, Hamiltonian function, is defined as their sum, 

H{e,p) = u{e) + K{p) (1) 

Given the Hamiltonian H{d,p), the system of (0,p) evolves according to the following 
Hamilton’s equations, 

e = VpH(0,p) = M-ip 

p = -VeH{e,p) = -VeU{0) 

In practice when the analytical solution to Hamilton’s equations is not available, we 
need to numerically solve these equations by discretizing them, using some small time step 
e. For the sake of accuracy and stability, a numerical method called leapfrog is commonly 
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used to approximate the Hamilton’s equations (Neal, 2011). We usually solve the system 


for L steps, with some step size, e, to propose a new state in the Metropolis algorithm, and 
accept or reject it according to the Metropolis acceptance probability. (See Neal, 2011, for 
more discussions). 


2.2 Lagrangian Monte Carlo 

Although HMC explores the target distribution more efficiently than RWM, it does not fully 


exploit its geometric properties of the parameter space. Girolami and Calderhead (2011) 


propose Riemannian HMC (RHMC), which adapts to the local Riemannian geometry of the 
target distribution by using a position-specific mass matrix M = G{9). More specifically, 
they set G(0) to the Fisher information matrix. In this paper, we mainly use spherical 
metric instead to serve the purpose of constraint handling. The proposed method can be 
viewed as an extension to this approach since it explores the geometry of sphere. 

Following the argument of Amari and Nagaoka (2000), Girolami and Calderhead] ( 2011j ) 
dehne Hamiltonian dynamics on the Riemannian manifold endowed with metric G(0). With 
the non-flat metic, the momentum vector becomes p|0 ~ A/'(0, G{6)) and the Hamiltonian 
is therefore defined as follows: 


1 




(j){9) := U{9) + ^logdetG(6>) 


( 3 ) 


Unfortunately the resulting Riemannian manifold Hamiltonian dynamics becomes non- 
separable since it contains products of 9 and p, and the numerical integrator, generalized 
leapfrog, is an implicit scheme that involves time-consuming hxed-point iterations. 


Lan et al. (2014) propose to change the variables p i— v := G(0)-ip and define an 


explicit integrator for RHMC by using the following equivalent Lagrangian dynamics: 

0 = V (4) 

V =-v^r(0)v - G(0)^iV0(/>(0) (5) 

where the velocity v|0 ~ AA(0, G(0)“^). Here, r(0) is the Christoffel Symbols derived from 

G{9). 

The proposed explicit integrator is time reversible but not volume preserving. Based on 
the change of variables theorem, one can adjust the acceptance probability with Jacobian 
determinant to satisfy the detailed balance condition. The resulting algorithm, Lagrangian 


Monte Carlo (LMC), is shown to be more efficient than RHMC (See Lan et ah, 2014, for 
more details). 

Throughout this paper, we express the kinetic energy K in terms of velocity, v, instead 


of momentum, p (Beskos et ah, 2011 Lan et ah, 2014) 


3. Spherical Augmentation 

In this section, we introduce the spherical augmentation technique for handling norm con¬ 
straints implicitly. We start with two simple constraints: ball type (2-norm) and box type 
(oo-norm). Then, we generalize the methodology to arbitrary g-norm type constraints for 
q > 0. Finally, we discuss some functional constraints that can be reduced to norm con¬ 
straints. 
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Figure 2: Transforming the unit ball Bq{\) to the sphere . 


3.1 Ball type constraints 

Consider probability distributions confined to the Zl-dimensional unit ball ^^(1) := {0 £ 

■ ll^lb = -1 Gf < !}• The constraint is given by restricting the 2-norm of parame¬ 

ters: || 0||2 < 1. 

The idea of spherical augmentation is to augment the original H-dimensional manifold 
of unit ball Bq{1) to a hyper-sphere := {6 £ : ||0||2 = 1} in {D + l)-dimensional 

space. This can be done by adding an auxiliary variable 9d+i to the original parameter 
6 £ Bq{1) to form an extended parameter 6 = {9,9d^i) such that 6d+i = — H^Hl- 

Next, we identify the lower hemisphere with the upper hemisphere by ignoring the 
sign of Od+ 1 - This way, the domain of the target distribution is changed from the unit ball 
Bq{1) to the H-dimensional sphere, := {6 £ : ||0||2 = 1}, through the following 

transformation: 


Tb^s : B^{1) e^~e = (0, ±^l-||0||i) (6) 

which can also be recognized as the coordinate map from the Euclidean coordinate chart 
{6,Bq{1)} to the manifold . 

After collecting samples {0} using a sampling algorithm (e.g., HMC) defined on the 
sphere, , we discard the last component 0 _d+i and obtain the samples {0} that automat¬ 
ically satisfy the constraint ||0||2 < 1- Note that the sign of 9d+i does not affect our Monte 
Carlo estimates. However, after applying the above transformation, we need to adjust our 
estimates according to the change of variables theorem as follows: 



/(0)d0H 



dOs 

dOs. 


dOs. 


(7) 


where 


d0B 


dOs^ 


= |0_D-i-i| as shown in Corollary 


volume elements under the Euclidean metric anc 


in Appendix 


A.l 


Here, d0g and dOs^ are 


the canonical spherical metric respectively. 
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Figure 3: Transforming the hyper-rectangle TZq to the sphere . 

With the above transformation the resulting sampler is defined and moves freely 
on while implicitly handling the constraints imposed on the original parameters. As 
illustrated in Figure]^ the boundary of the constraint, i.e., ||0||2 = 1, corresponds to the 
equator on the sphere . Therefore, as the sampler moves on the sphere, e.g. from A 
to B, passing across the equator from one hemisphere to the other translates to “bouncing 
back” off the boundary in the original parameter space. 


3.2 Box type constraints 


Many constraints are given by both lower and upper bounds. Here we focus on a special 
case that defines a hyper-rectangle TZq := [0,7r]^“^ x [0, 27r); other box type constraints 
can be transformed to this hyper-rectangle. This constrained domain can be mapped to 
the unit ball Bq{1) and thus reduces to the ball type constraint discussed in Section 
However, a more natural approach is to use spherical coordinates, which directly map the 
hyper-rectangle TZq to the sphere S^, 


3.1 


TtIo^S ■ 7^0 -^ ‘ 5 '^, e Xd 


cos{6d) n^=i sin(6»j), d < D + 1 
n^iSinC^i), d = D + l 


( 8 ) 


Therefore, we use {6,7Zq} as the spherical coordinate chart for the manifold S^. Instead 

is treated as the 


of being appended with an extra dimension as in Section 


3.1 


here 6 G 


spherical coordinates of the point x G with ||x ||2 = 1. 

After obtaining samples {x} on the sphere S^, we transform them back to {0} in the 
original constrained domain TZq using the following inverse mapping of 


Ts^tIo ■ S 


D 


TZq , X I—)■ 0, 


= 


arccot 


Xd 


d<D 


arccot 


xd+1 


1 - Eti 

TT 

-F - sign(xD+i)(sign(x£)+i) - 1), d = D 

(9) 
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0 < q < oo 


q = cx) 



Figure 4: Transforming g-norm constrained domain to unit ball. Left: from unit cube 
to unit ball ^,^(1); Right from general g-norm domain to unit ball Bq{1). 


Similarly, we need to adjust the estimates based on the following change of variables formula: 




f{e)den, = / f{e) 

Jso 


dOjla 


dOf 


dOs^ 


( 10 ) 


where 


dOj, 


dOSr 


= sin {Oil) as shown Proposition 


A.3 


in Appendix 


A.3 


Here, 


dO^i^ and dOg^ are volume elements under the Euclidean metric and the round spherical 
metric respectively. 

With the above transformation Q, we can derive sampling methods on the sphere to 
implicitly handle box type constraints. As illustrated in Figure the red vertical boundary 


of TZq collapses to the north pole of , while the green vertical boundary collapses to the 
south pole. Two blue horizontal boundaries are mapped to the same prime meridian of 
shown in blue color. As the sampler moves freely on the sphere S^, the resulting samples 
automatically satisfy the original constraint after being transformed back to the original 
domain. 


3.3 General g-norm constraints 

The ball and box type constraints discussed in previous sections are in fact special cases 
of more general g-norm constraints with q set to 2 and oo respectively. In general, these 
constraints are expressed in terms of ( 7 -norm of the parameter vector (3 G 


II/3II, 


(Ef-ilftl’jA <ie(0,+oo) 

maxi<j<z) \/3i\, q = +cx) 


( 11 ) 
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This class of constraints is very common in statistics and machine learning. For example, 
when f3 are regression parameters, q = 2 corresponds to the ridge regression and q = 1 


corresponds to Lasso (Tibshirani, 1996). 


Denote the domain constrained by general g-norm as ;= {/3 G : ||/3||q < 1}. It 
could be quite challenging to sample probability distributions defined on (see Figure 
[^. To address this issue, we propose to transform to the unit ball Bq{1) so that the 
method discussed in Section ^ can be applied. As before, sampling methods defined on 
the sphere generate samples that automatically fall within .6^(1). Then we transform 
those samples back to the g-norm domain, Q^, and adjust the estimates with the following 
change of variables formula: 


Iqd 


f{(3)d(3Q = / fCe) 




d^Q 


dOs. 


des. 


( 12 ) 


where 


df3Q 


dldp 

des 



dOs, 


del 

dOs, 


del 


|0£)_|_i|. In the following, we introduce the bijective 


mappings between and Bq{1) and specify the associated Jacobian determinants 


<^Pg. 

del 


3.3.1 Norm constraints with q g (0,+oo) 

For q G (0, +oo), g-norm domain can be transformed to the unit ball Bq{1) bijectively 
via the following map (illustrated by the left panel of Figure]^: 

Tq^b- ^B^il), A^0, = sgn(A)|Ar/' (13) 


The Jacobian determinant of Tq^q is 
more details. 


d^ 

del 



2/g-l 

. See Appendix 


B 


for 


3.3.2 Norm constraints with q = +oo 

When q = +oo, the norm inequality defines a unit hypercube, := [—1,1]^ = {/3 G : 
||/3||oo < 1}) from which the more general form, hyper-rectangle, TZ^ := {/3 G : 1 < 
/3 < u}, can be obtained by proper shifting and scaling. The unit hypercube can be 
transformed to its inscribed unit ball ^^(1) through the following map (illustrated by the 
right panel of Figure]^: 

Tc^b : [-1,1]^ ^ B^{1), I3e^e = /3& (14) 


The Jacobian determinant of Tb^ti is 
in Appendix [B| 


del 


n£i details can be found 


3.4 Functional constraints 


Many statistical problems involve functional constraints. For example, [Pakman and Panin¬ 


ski (2013) discuss linear and quadratic constraints for multivariate Gaussian distributions. 


Since the target distribution is truncated Gaussian, Hamiltonian dynamics can be exactly 
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Figure 5: Sampling from a Gaussian distribution (first row) and a damped sine wave dis¬ 
tribution (second row) with linear constraints. First column shows the true dis¬ 
tributions. The exact HMC method of Pakman and Paninski (2013) is shown in 
the second column. The last two columns show our proposed methods. 


simulated and the boundary-hitting time can be analytically obtained. However, finding 
the hitting time and reflection trajectory is computationally expensive. Some constraints of 
this type can be handled by the spherical augmentation method more efficiently. Further, 
our method can be use for sampling from a wide range of distributions beyond Gaussian. 


3.4.1 Linear constraints 

In general, M linear constraints can be written as 1 < Af3 < u, where A is M x D matrix, 
/3 is a H-vector, and the boundaries 1 and u are both M-vectors. Here, we assume M = D 
and AdxD is invertible. (Note that we generally do not have A~^l < j3 < A~^u.) Instead 
of sampling /3 directly, we sample rj := A/3 with the box type constraint: 1 < ly < u. Now 
we can apply our proposed method to sample ry and transform it back to (3 = A~^ry. In 
this process, we use the following change of variables formula: 


/ f{(3)d(3 = / /(ry) 

'l<A/3<u 


df3 


dr] 


dr] 


(15) 


where 


dr) 

Figure 


= IA| 


I 5| illustrates that both exact HMC (Pakman and Paninski, 2013) and HMC with 

r_Q_5 i~ 

spherical augmentation can handle linear constraints, here 1 = 0, A = 


imposed on a 2d Gaussian distribution AA(/i, U) with ]i = 


row). However, the exact HMC is not applicable to more comp 


1 1 

1 0.5' 

0.5 1 

rcated distributions such as 


and S = 


and u = 2, 
(first 
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the damped sine wave distribution (second row in Figure]^ with the following density: 


/(/3) oc 


sin^ Q(/3) 

QiP) ' 


Q{(3) =-iP - E-\(3 - fi) 


(16) 


However, it is worth noting that for truncated Gaussian distributions, the exact HMC 
method of Pakman and Paninski (2013) can handle a wider range of linear constraints 
compared to our method. 


3.4.2 Quadratic constraints 

General quadratic constraints can be written as I < p'^Ap + P < u, where l,u > 0 
are scalars. We assume A^xD symmetric and positive definite. By spectrum theorem, we 
have the decomposition A = Qi'Q''', where Q is an orthogonal matrix and A' is a diagonal 
matrix of eigenvalues of A. By shifting and scaling, P ^ P* = {P + ^A~^b), we 

only need to consider the ring type constraints for /3*, 

@ : r <\\p*\\l = {P*YP* <u*, r =l + ^\P'A-^h, u* =u+^h^A~^h (17) 

which can be mapped to the unit ball as follows: 




H?(l), „£ii ^ (18) 


\\P* 


-Vi* 


We can then apply our proposed method in Section 3.1 to obtain samples {0} in Bq{1) and 
transform them back to the original domain with the following inverse operation of (18): 

: B^(l) ^ b^{V^)\bE{V¥), e^p*= ^ 


\e\ 


u* — Vl*)\\ 0\\2 + W) (19) 


In this process, we need the change of variables formula 


//</3TA/3+bT/3<i 


f{p)dp= / /(0) 


dp 


dOs. 


dOs. 


( 20 ) 


where 

1 )\/P. 


d/3 


dl3 

df3* 

dOs 

dOs, 



de'e 

dOs, 


A| •20^ ^{a - VI*)\0 d+i\, a = ViV + (1/||0||2 - 


3.4.3 More general constraints 

We close this section with some comments on more general types of constraints. In some 
problems, several parameters might be unconstrained, and the type of constraints might be 
vary across the constrained parameters. In such cases, we could group the parameters into 
blocks and update each block separately using the methods discussed in this section. When 
dealing with one-sided constraints, e.g. 9i > li, one can map the constrained domain to the 
whole space and sample the unconstrained parameter 6*, where 6i = \6*\+li. Alternatively, 
the one-sided constraint 9i > li can be changed to a two-sided constraint for 9* G (0,1) by 
setting 9i = -log0* -|- k. 
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4. Monte Carlos with Spherical Augmentation 

In this section, we show how the idea of spherical augmentation can be used to improve 
Markov Chain Monte Carlo methods applied to constrained probability distributions. In 
particular, we focus on two state-of-the-art sampling algorithms, namely Hamiltonian Monte 


Carlo(Duane et ah, 1987; Neal, 2011), and Lagrangian Monte Carlo(Lan et ah, 2014). Note 


however that our proposed method is generic so its application goes beyond these two 
algorithms. 


4.1 Common settings 

Throughout this section, we denote the original parameter vector as /3, the constrained 
domain as T>, the coordinate vector of sphere as 6. All the change of variables formulae 
presented in the previous section can be summarized as 

dfB-jy 


/ f{f3)df3v = / fiO) 
Iv Js 


dOs 


where 


dOs 


dOs (21) 

T> and dOs is some 


is the Jacobian determinant of the mapping T : S 
spherical measure. 

For energy based MCMC algorithms like HMC, RHMC and LMC, we need to investigate 
the change of energy under the above transformation. The original potential energy function 
U{(3) = — log/(/3) should be transformed to the following (t>{0) 


</,(0) = -log/(0) - log 




dOs 


= [/(/3(0))-log 


d(3j) 


dOs 


Consequently the total energy H (/3, v) in ([^ becomes 


1 


H{e,v) = ^{e) + -(v, v)g 5 ( 0 ) 


( 22 ) 


(23) 


The gradient of potential energy U, metric and natural gradient (preconditioned gradient) 
under the new coordinate system {0,5^} can be calculated as follows 


^eU{0) = 

Gsi0) = ^GM-'^^ 


Gsier^VgUie) = 


dO 

'dfd^ 

~de 


de' 


-1 


Gv{P)-^Vf3U{P) 


(24) 

(25) 

(26) 


4.2 Spherical Hamiltonian Monte Carlo 

We define HMC on the sphere in two different coordinate systems: the Cartesian co¬ 
ordinate and the spherical coordinate. The former is applied to ball type constraints or 
those that could be converted to ball type constraints; the later is more suited for box type 
constraints. Besides the merit of implicitly handling constraints, HMC on sphere can take 


advantage of the splitting technique (Beskos et ah, 2011 Shahbaba et ah, 2014 Byrne and 


Girolami, 2013) to further improve its computational efficiency. 
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4.2.1 Spherical HMC in the Cartesian coordinate 

We first consider HMC for the target distribution with density f{0) defined on the unit ball 
Bq{1) endowed with the Euclidean metric I. The potential energy is defined as U{6) := 
— logf{9). Associated with the auxiliary variable v (i.e., velocity), we define the kinetic 
energy K{'v) = 2 v'’'lv for v G TqBq{1), which is a H-dimensional vector sampled from the 
tangent space of Bq{1). Therefore, the Hamiltonian is defined on Bq{1) as 


H{9, v) = U{9) + A:(v) = U{9) + -v^Iv 


(27) 


Under the transformation in the above Hamiltonian (27) on Bq{1) will be 

changed to the follwing Hamiltonian H{9,'v) on as in (23): 


H{9,v) = m + -v^GsM^ = U{9) - log 


d9B 


d9s. 


+ (28) 


where the potential energy U{9) = U{9) (i.e., the distribution is fully defined in terms of the 
original parameter 9, which are the first D elements of 9), and G5^(0) = \d+99'^ /(1 — H^lH) 
is the canonical spherical metric. 

Viewing {9, Bq{1)} as the Euclidean coordinate chart of manifold {S^, Gs^{9)), we have 



the gradient of energy around the equator (0_d+i = 0), which in turn increases the numerical 
error in the discretized Hamiltonian dynamics. For the purpose of numerical stability, we in¬ 
stead consider the following partial Hamiltonian H*{9, v) and leave the volume adjustment 


as weights to adjust the estimation of integration (21): 

77*(0,v) = [/(0) +Vg 5 .( 0 )v 


(29) 


If we extend the velocity as v = (v,U£)+i) with vjo+i = — 0'''v/0£)+i, then v falls in the 
tangent space of the sphere, TgS^ := {v G 


0 V = 0}. Therefore, v'''G5^(0)v = v'''v. 
As a result, the partial Hamiltonian (29) can be recognized as the standard Hamiltonian 


(27) in the augmented {D -|- 1) dimensional space 


H*{9,ic) = Ui9)+K{ic) = U{9) + -v^v 


(30) 


This is due to the energy invariance presented as Proposition | A. 1 1 in Appendix Now 


we 


can sample the velocity v AA(O,G 5 ^( 0 ) ^) and set v = 


-9'^/Od+i 


V. Alternatively, 


since Cov[v] = 


-9'^ieD+i 


Gs^{9) ^ [l — 0/0D+i] = Id +1 — 99 is idempotent, we can 


sample v by (Id+i — 99 )z with z ~ ^"(0, Id+i)- 
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The Hamiltonian function (29) can be used to define the Hamiltonian dynamics on 


the Riemannian manifold {S^,Gs^{0)) in terms of (0,p), or equivalently as the following 
Lagrangian dynamics in terms of (0, v) (Lan et al., |2014): 


0 = V 


(31) 


V = -vTr5,(0)v-G5,(0)-^V0C/(0) 

where r 5 ^( 0 ) are the Christoffel symbols of second kind derived from 65 ^( 0 ). The Hamil¬ 


tonian (29) is preserved under Lagrangian dynamics (31). (See Lan et ah, 2014 for more 
discussion). 


[Byrne and Girolami (2013) split the Hamiltonian (29) as follows: 

= U{e)/2 + 1vTG5.(0)v + C/(0)/2 


(32) 


However, their approach requires the manifold to be embedded in the Euclidean space. To 
avoid this assumption, instead of splitting the Hamiltonian dynamics of (0, p), we split the 
corresponding Lagrangian dynamics (31) in terms of (0, v) as follows (See Appendix [C| for 
more details): 


0=0 

V = - 1g5.(0)-1V0C/(0) 


(33a) 


0 = V 

V = -v'^r5^(0)v 


(33b) 


Note that the hrst dynamics (33a) only involves updating velocity v in the tangent space 
TgS^ and has the following solution (see Appendix [c| for more details): 


e(t) = fl(o) 

v(*) = v(0)-i 


ID 


-0(O)0(O)^ V0[/(0(O)) 


(34) 


The second dynamics (33b) only involves the kinetic energy and has the geodesic flow that is 


a great circle (orthodrome or Riemannian circle) on the sphere as its analytical solution 


(See Appendix A .2 for more details): 


G{t) = 0 (O)cos(||v(O)|| 2 t)-k sin(||v( 0 )|| 2 t) 

l|v( 0)||2 

v(t) = - 0 (O)||N(O)|| 2 sin(||v(O)|| 2 t) + v(0)cos(||v 


(35) 


ii) 


This solution dehnes an evolution, denoted as gt : (0(0), v(0)) 1 —;■ (0(t), v(t)). Both (34) and 


(35) are symplectic. Due to the explicit formula for the geodesic flow on sphere, the second 


dynamics in (33b) is simulated exactly. Therefore, updating 0 does not involve discretization 


error so we can use large step sizes. This could lead to improved computational efficiency. 
Because this step is in fact a rotation on sphere, it can generate proposals that are far 
away from the current state. Algorithm shows the steps for implementing this approach, 
henceforth called Spherical HMC in the Cartesian coordinate (c-SphHMC). It can be shown 
that the integrator in the algorithm has order 3 local error and order 2 global error (See 
the details in Appendix [Pj) . 
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Algorithm 1 Spherical HMC in the Cartesian coordinate (c-SphHMC) 

Initialize 6^ ^ at current 6 after transformation 
Sample a new velocity value ~ AA(0 , Id+i) 


Set •(— 

Calculate + K{vW) 


for 7" = 1 

to L do 


= vW - . 


= 0(^)cos 

v(^+|) 


vC+b 

= vb+ 5 ). 

end for 



Id 
qt 

(^+ 2 ) lie) + 


^^“'"2) lie) 


v(^+3) 

Ilvi^+2i| 


■ sm 


;i(^+2)|| sin(||v*^^’''2)||e) + v(^+2) cos(||v^^'''2)||e) 


Id 

qt 


0(^+1) (0(£+l))Tj 

Calculate + K{v(^+^^) 


Calculate the acceptance probability a = min{l,exp[—+ 
Accept or reject the proposal according to a for the next state Q 
Calculate Ts^vi'S ) and the corresponding weight |dr 5 _^x>| 


4.2.2 Spherical HMC in the spherical coordinate 


Now we define HMC on the sphere in the spherical coordinate {0,7^^}. The natural 
metric on the sphere induced by the coordinate mapping Q is the round spherical 
methcj^ Gs,{0) = diag [l, sin^(6li), • • • , sin2(6lrf)]. 

we start with the usual Hamiltonian H{6,'v) defined on (7^^, I) as 


As in Section 


4.2.1 


in (27) with v G TqTZq . Under the transformation : 0 1 —)■ x in 

on TZq is changed to the following Hamiltonian 77(x, x) as in (23): 


Hamiltonian (27) 


if(x,i) = (/)(x) + ^v'^G 5 ^( 0 )v = C/(x) - log 


dOno 


dOs^ 


+ K-^GsAO)^ 


(36) 


where the potential energy f 7 (x) = U{9{x)) and 65^(0) is the round spherical metric. 


As before, the logorithm of volume adjustment is log 


dOlZn 


dOsj. 


= -\\og\Gs^ = -^thD- 


d)logsin(0rf) (See Appendix A.3). The last two terms in Equation (36) is the minus log 
density of v|0 ~ AA(0, G5^(0)~“^). Again, for numerical stability we consider the following 
partial Hamiltonian H*{x,x) and leave the volume adjustment as weights to adjust the 


estimation of integration (21): 


H*(x,i) = C/(0 ) +Vg 5 .( 0 )v 


(37) 


1 . Note, v'''Gs^( 0 )v < ||v||| < ||v||i = v'''Gs^( 0 )v. 
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Taking derivative of : 0 e->■ x in Q with respect to time t we have 

-Vdtan{6d)+ '^ViCot{6i)]xd, d<D + l 

i<d 

ViCot{9i)xD+i, d = D + l 


Xd= \ 


(38) 


We can show that x(0)'''x(0, v) = 0; that is, x G . Taking derivative of : x i—)■ 0 

in with respect to time t yields 


Vd ■=9d= < 


Xd 


^^d—1 

^ Lj=l 


XdXD+I—XDXD+1 


xjj + 


d<D 


d = D 


(39) 


D+l 


Further, we have v^G 5 ^( 0 )v = idic. Therefore, the partial Hamiltonian (37) can be 


recognized as the standard Hamiltonian (27) in the augmented {D + 1) dimensional space. 


which is again explained by the energy invariance Proposition A.l (See more details in 
Appendix 


H* (x, x) = [/(x) + AT (x) = [/(x) + 2 


(40) 


Similar to the method discussed in Section 


4.2.1 


we split the Hamiltonian (37), H*{9, v) = 


U{6)/2+^v'Gs^{9)v+U{0)/2, and its corresponding Lagrangian dynamics (31) as follows: 


0=0 

V = -^G5.(0)-'V0[/(0) 


(41a) 


9 = 


= -v'^r5^(0)v 


(41b) 


The first dynamics (41a) involves updating the velocity v only. However, the diagonal 
term of Gs^{9)~^, sin ‘^{Oi) increases exponentially fast as dimension grows. This will 


cause the velocity updated by (41a) to have extremely large components. To avoid such 
issue, we use small time vector e = [e, e^, • • • , e^], instead of scalar e, in updating Equation 


(41a 


The second dynamics (41b) describes the same geodesic flow on the sphere as 
(33b) but in the spherical coordinate {9,TIq}. Therefore it should have the same solution 
p5[ ) expressed in {0, TZ^}. To obtain this solution, we first apply T-jig^s : (0(0), v(0)) i— )• 
(x(0),x(0)), which consists of ([^(38). Then, we use gt in (35) to evolve (x(0),x(0)) for some 


time t to find (x(t),x(t)). Finally, we use Ts^tio ■ (^(0)^(^)) (^(^)w(^))) composite of 

), to go back to TZq . 

Algorithm [^summarizes the steps for this method, called Spherical HMC in the spherical 
coordinate (s-SphHMC). In theory, the hyper-rectangle TZ^ can be used as a base type 
(as the unit ball Bq{1) does) for general ( 7 -norm constraints for which s-SphHMC can be 
applied. This is because g-norm domain can be bijectively mapped to the hypercube 
C^, and thereafter to TZq . However the involved Jacobian matrix is rather complicated and 
s-SphHMC used in this way is not as efficient as c-SphHMC. Therefore, we use s-SphHMC 
only for box type constraints. 


15 

























Lan and Shahbaba 


Algorithm 2 Spherical HMC in the spherical coordinate (s-SphHMC) 


Initialize at current 0 after transformation Tx>^s 
Sample a new velocity value ~ AA(0, Id) 


Set V 


( 1 ) 


(1) rr'i-i m-irfl(i) 


n 


i=i sm 


ion,d = i, 


,D 


Calculate + K{vW) 

for = 1 to L do 

-^d )llj=isin {(’i ) 


d = 1, - ■ ■ ,D 




(i+h) 




d = 1, • • • ,D 

end for 

Calculate 

Calculate the acceptance probability a = min{l,exp[—+ v^^))]} 

Accept or reject the proposal according to a for the next state O' 

Calculate Ts^v{0') and the corresponding weight \dTs^v\ 


4.3 Spherical LMC on probability simplex 

A large class of statistical models involve dehning probability distributions on the simplex 


K 

:= {tt e M^l vTfc > 0, ^ TTrf = 1} 
k=l 


(42) 


As an example, we consider latent Dirichlet allocation (LDA) (Blei et ah, 2003), which 


is a hierarchical Bayesian model commonly used to model document topics. This type of 
constraints can be viewed as a special case of the 1-norm constraint, discussed in Section 
3.1 by identifying the first orthant (all positive components) with the others. Then, the 
c-SphHMC algorithmcan be applied to generate samples {0} on the sphere . These 
samples can be transformed as { 0 ^} and mapped back to the simplex A^. 


In what follows, we show that Fisher metric on the root space of simplex, := {0 G 

> 0, V/c = 1, • • • , K} (i.e. the hrst orthant of the sphere S^~ ■^), is the same as the 
canonical spherical metric 65 ^( 0 ) up to a constant. In this sense, it is more natural to dehne 

on the sphere S^~^. We start with the toy example discussed in 
). Denote the observed data as x = where each data point 

belongs to one of the K categories with probability p{xi = A:|7r) = vr^.. We assume a Dirichlet 
prior on tt: p(7r) oc The posterior distribution is p(7r|x) oc 

where d{xi = k) counts the points Xi in category k. Denote n = [ni, • • • 

and n := |n| = Tor inference, we need to sample from the posterior distribution 

p(7r|x) dehned on the probability simplex. 


the sampling algorithms 
Patterson and Teh (2013 
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Figure 6: Dirichlet-Multinomial model: probability estimates (left) and autocorrelation 
function for MCMC samples (right). 


The Fisher information matrix is a function of 7r_x (here, ‘—A'’ means all but the AT-th 
components) and is calculated as follows: 

GF(7r_i^) = -E[V^logp(x|7r_K)] 

= -E[V^(nI^ log{7v_K) + (n - nl^^l) log(l - vrl^l))] 

= -E[V{n_K/'^-K - l(n - nl^l)/(l - (43) 

= -E[- diag(n_i^/7r?.^) - ll'^(n - nl^l)/(1 - Trl^l)^] 

= n[diag(l/7r_ii') + 


Now we use I—)■ 0 = y/iv to map the simplex to the sphere (the first orthant). 

Note that = 2diag(0_R')- Therefore, we have a proper metric on \/A^ as follows: 

du_K 




d^-K 

dO-K 


GF(7r_ic') 


d-i^-K 

dOlK 


4n[lR'_i + O^kO-k'^= AnGs^O) 


(44) 


where the scalar 4n properly scales the metric in high dimensional data intensive models. 
In EDA particularly, n could be the number of words counted in the selected documents. 
Hence, we use Gy^(0) instead of G5^(0). We refer to the resulting method as Spherical 
Lagrangian Monte Carlo (SphLMC). 

Recall that in the development of Spherical HMC algorithms, we decided to omit the log 


volume adjustment term, log 


d-Px) 

dOs 


, in the partial Hamiltonian (29) and (37), and regard it 


as the weight to adjust the estimate of (21) or resample. This is not feasible if the EDA 
model is going to be used in an online setting. Therefore, we use (/)(0) in (22), as opposed 
to U(0) to avoid the re-weighting step. 
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Lan and Shahbaba 


To illustrate our proposed method, we consider the toy example discussed above. For 
this problem, Patterson and Teh (2013) propose a Riemannian Langevin Dynamics (RLD) 
method, but use an expanded-mean parametrization to map the simplex to the whole space. 
As mentioned above, this approach (i.e., expanding the parameter space) might not be ef¬ 
ficient in general. This is illustrated in Figure Here, we set a = 0.5 and run RMW, 
WallHMC, RLD, and SphLMCj^for 1.1 x 10® iterations; we discard the first 10^ samples. 
As we can see in Figure compared to alternative algorithms, our SphLMC method pro¬ 
vides better probability estimates (left panel). Further, SphLMC generates samples with a 
substantially lower autocorrelation (right panel). 


5. Experimental results 


In this section, we evaluate our proposed methods using simulated and real data. To 


this end, we compare their efficiency to that of RWM, Wall HMC, exact HMC (Pakman 


and Paninski 


2013[ ), and the Riemannian Langevin dynamics (RLD) algorithm proposed 
by [Patterson and Teh| (2013) for LDA. We define efficiency in terms of time-normalized 
effective sample size (ESS). Given N MCMC samples, for each parameter, we dehne ESS = 
+ , where p{k) is sample autocorrelation with lag k (Geyer, 1992). We use 


the minimum ESS normalized by the CPU time, s (in seconds), as the overall measure of 
efficiency: min(ESS)/s. All computer codes are available online at http://www.ics.uci. 
edu/-s1an/SphHMC, 


5.1 Truncated Multivariate Gaussian 

Eor illustration purpose, we start with a truncated bivariate Gaussian distribution, 



~ M 



1 

.5 



0 < < 5, 0 < /32 < 1 


This is box type constraint with the lower and upper limits as 1 = (0,0) and u = (5,1) 
respectively. The original rectangle domain can be mapped to 2d unit disc to use 

c-SphHMC, or mapped to 2d rectangle TZq where s-SphHMC can be directly applied. 

The upper leftmost panel of Figure shows the heatmap based on the exact density 
function, and the other panels show the corresponding heatmaps based on MCMC samples 
from RWM, Wall HMC, exact HMC, c-SphHMC and s-SphHMC respectively. Table[^com- 
pares the true mean and covariance of the above truncated bivariate Gaussian distribution 
with the point estimates using 2 x 10® (2 x 10“^ for each of 10 repeated experiments with 
different random seeds) MCMC samples in each method. Overall, all methods estimate the 
mean and covariance reasonably well. 

To evaluate the efficiency of the above-mentioned methods, we repeat this experiment 
for higher dimensions, D = 10, and D = 100. As before, we set the mean to zero and set 
the (i, j)-th element of the covariance matrix to Sij = 1/(1 + \i — j\). Further, we impose 


2. Note, the natural gradient in ( |34[ | to update v is 
O.5)/0 - 0 * |n a - 0.5|]/(2n). 


T-k-1 

-OIkIOk^ 




[(n -I- a — 
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Truth Estimate by RWM Estimate by Wall HMC 



Figure 7: Density plots of a truncated bivariate Gaussian using exact density function 
(upper leftmost) and MCMC samples from RWM, Wall HMC, exact HMC, c- 
SphHMC and s-SpliHMC respectively. 


Method 

Mean 

Covariance 

Truth 

0.7906 

0.4889 


0.3269 0.0172 

0.0172 0.08 

RWM 


0.7796 ± 0.0088 

0.4889 ± 0.0034 



0.3214 ±0.009 0.0158 ± 0.001 ■ 

0.0158 ± 0.001 0.0798 ± 5e - 04 


Wall HMC 


0.7875 ± 0.0049 
0.4884 ± 8e - 04 



0.3242 ± 0.0043 0.017 ±0.001 

0.017 ±0.001 0.08±3e-04 


exact HMC 


0.7909 ± 0.0025 
0.4885 ±0.001 



0.3272 ± 0.0026 0.0174 ± 7e - 04 

0.0174 ± 7e - 04 0.08 ± 3e - 04 


c-SphHMC 


0.79 ±0.005 

0.4864 ± 0.0016 



0.3249 ± 0.0045 0.0172 ± 0.0012 

0.0172 ± 0.0012 0.0801 ± 0.001 


s-SphHMC 


0.7935 ± 0.0093 
0.4852 ± 0.003 



0.3233 ± 0.0062 0.0202 ± 0.0018 

0.0202 ± 0.0018 0.0791 ±9e-04 



Table 1: Comparing the point estimates for the mean and covariance of a bivariate trun¬ 
cated Gaussian distribution using RWM, Wall HMC, exact HMC, c-SphHMC and 
s-SphHMC. 


the following constraints on the parameters, 

0 < /3j < ttj 

where Ui (i.e., the upper bound) is set to 5 when i = 1; otherwise, it is set to 0.5. 
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For each method, we obtain 10^ MCMC samples after discarding the initial 10^ samples. 
We set the tuning parameters of algorithms such that their overall acceptance rates are 
within a reasonable range. As shown in Table[^ Spherical HMC algorithms are substantially 
more efficient than RWM and Wall HMC. For RWM, the proposed states are rejected about 
95% of times due to violation of the constraints. On average, Wall HMC bounces off the 
wall around 3.81 {L = 2) and 6.19 {L = 5) times per iteration for D = 10 and D = 100 
respectively. Exact HMC is quite efficient for relatively low dimensional truncated Gaussian 
[D = 10); however it becomes very slow for higher dimensions {D = 100). In contrast, 
by augmenting the parameter space. Spherical HMC algorithms handle the constraints 
in a more efficient way. Since s-SphHMC is more suited for box type constraints, it is 
substantially more efficient than c-SphHMC in this example. 


Dim 

Method 

AP 

s/iter 

ESS (min,med,max) 

Min(ESS)/s 

spdup 


RWM 

0.62 

5.72E-05 

(48,691,736) 

7.58 

1.00 


Wall HMC 

0.83 

1.19E-04 

(31904,86275,87311) 

2441.72 

322.33 

D= 10 

exact HMC 

1.00 

7.60E-05 

(le-h05,le-h05,le-F05) 

11960.29 

1578.87 


c-SphHMC 

0.82 

2.53E-04 

(62658,85570,86295) 

2253.32 

297.46 


s-SphHMC 

0.79 

2.02E-04 

(76088,le-h05,le-F05) 

3429.56 

452.73 


RWM 

0.81 

5.45E-04 

(1,4,54) 

0.01 

1.00 


Wall HMC 

0.74 

2.23E-03 

(17777,52909,55713) 

72.45 

5130.21 

D= 100 

exact HMC 

1.00 

4.65E-02 

(97963,le-h05,le-F05) 

19.16 

1356.64 


c-SphHMC 

0.73 

3.45E-03 

(55667,68585,72850) 

146.75 

10390.94 


s-SphHMC 

0.87 

2.30E-03 

(74476,99670,le-F05) 

294.31 

20839.43 


Table 2: Comparing the efficiency of RWM, Wall HMC, exact HMC, c-SphHMC and s- 
SphHMC in terms of sampling from truncated Gaussian distributions. AP is 
acceptance probability, s/iter is seconds per iteration, ESS (min,med,max) is the 
(minimum, median, maximum) effective sample size, and Min(ESS)/s is the time- 
normalized minimum ESS. 


5.2 Bayesian Lasso 


In regression analysis, overly complex models tend to overfit the data. Regularized regres¬ 
sion models control complexity by imposing a penalty on model parameters. By far, the 
most popular model in this group is Lasso (least absolute shrinkage and selection operator) 


proposed by Tibshirani (1996). In this approach, the coefficients are obtained by minimizing 


the residual sum of squares (RSS) subject to a constraint on the magnitude of regression 
coefficients, 

RSS(/3) ■.= Y^{y,-i3o-xJ(3f (45) 


min RSS(/3), 


One could estimate the parameters by solving the following optimization problem: 


min RSS(/3) -|- 

13 ,\ 


( 46 ) 
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Bayesian Lasso 
Gibbs Sampier 



Bayesian Lasso 
Waii HMC 



Bayesian Lasso 
Sphericai HMC 



Figure 8: Bayesian Lasso using three different sampling algorithms: Gibbs sampler (left), 
Wall HMC (middle) and Spherical HMC (right). 


where A > 0 is the regularization parameter. Park and Casella (2008) and Hans ( 2009| ) 
have proposed a Bayesian alternative method, called Bayesian Lasso, where the penalty 
term is replaced by a prior distribution of the form H(/3) oc exp(—A|/3|), which can be repre¬ 
sented as a scale mixture of normal distributions (West, 1987). This leads to a hierarchical 


Bayesian model with full conditional conjugacy; therefore, the Gibbs sampler can be used 
for inference. 


Our proposed spherical augmentation in this paper can directly handle the constraints 
in Lasso models. That is, we can conveniently use Gaussian priors for model parameters, 
/3|cr2 J\f {0, a‘^1), and let the sampler automatically handle the constraint. In particular, 
c-SphHMC can be used to sample posterior distribution of /? with the 1-norm constraint. 
For this problem, we modify the Wall HMC algorithm, which was originally proposed for 
box type constraints (Neal 2011). See Appendixfor more details. 


We evaluate our method based on the diabetes data set (N=442, D=10) discussed in 
Park and Casella ( 2008[). Figure [^compares coefficient estimates given by the Gibbs sampler 
(Park and Casella, |2008[) , Wall HMC, and Spherical HMC respectively as the shrinkage 
factor s := changes from 0 to 1. Here, denotes the estimates 

obtained by ordinary least squares (OLS) regression. For the Gibbs sampler, we choose 
different A so that the corresponding shrinkage factor s varies from 0 to 1. For Wall HMC 
and Spherical HMC, we fix the number of leapfrog steps to 10 and set the trajectory length 
such that they both have comparable acceptance rates around 70%. 


Figure [^compares the sampling efficiency of these three methods. As we impose tighter 
constraints (i.e., lower shrinkage factors s). Spherical HMC becomes substantially more 
efficient than the Gibbs sampler and Wall HMC. 
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Figure 9: Sampling efficiency of different algorithms for Bayesian Lasso based on the dia¬ 
betes dataset. 


5.3 Bridge regression 

The Lasso model discussed in the previous section is in fact a member of a family of regres¬ 


sion models called Bridge regression (Frank and Friedman, 1993), where the coefficients are 


obtained by minimizing the residual sum of squares subject to a constraint on the magnitude 
of regression coefficients as follows: 


min RSS(/3), RSS(/3) := - /3o - xj/3) 


(47) 


For Lasso, q = 1, which allows the model to force some of the coefficients to become exactly 
zero (i.e., become excluded from the model). When q = 2, this model is known as ridge 
regression. Bridge regression is more flexible by allowing different q norm constraints for 


different effects on shrinking the magnitude of parameters (See Figure 10) 


While the Gibbs sampler method of Park and Casella (2008) and Hans (2009) is limited 


to Lasso, our approach can be applied to all bridge regression models with different q. 
To handle the general g-norm constraint, one can map the constrained domain to the unit 
ball by (13) and apply c-SphHMC. Figure [To| compares the parameter estimates of Bayesian 
Lasso to the estimates obtained from two Bridge regression models with = 1.2 and q = 0.8 
for the diabetes dataset ( Park and Casella] |2008 ) using our Spherical HMC algorithm. As 
expected, tighter constraints (e.g., q = 0.8) would lead to faster shrinkage of regression 
parameters as we decrease s. 

5.4 Reconstruction of quantized stationary Gaussian process 

We now investigate the example of reconstructing quantized stationary Gaussian process 


discussed in Pakman and Paninski] (2013). Suppose we are given N values of a function 
f{xi),i = I,-- - ,N, which takes discrete values from {qk}k=i- We assume that this is a 
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Figure 10: Bayesian Bridge Regression by Spherical HMC: Lasso (q=l, left), q=1.2 (mid¬ 
dle), and q= 0.8 (right). 


quantized projection of a sample y{xi) from a stationary Gaussian process with a known 
translation-invariant covariance kernel of the form Eij = K{\xi — Xj\), and the quantization 
follows a known rule of the form 


f{xi) = Qk, if Zk < y{xi) < Zk+i (48) 

The objective is to sample from the posterior distribution 

p{y{xi),--- ,y(xAr)|/(xi), • • • , /(xat)) ~ AA(0,17) trunctated by rule (|^ (49) 

In this example, the function is sampled from a Gaussian process with the following kernel 


K{\xi — Xj\) = cr^ exp < — 


Xi — Xi 


2r/2 


0-2 = 0 . 6 , ri^ = 0.2 


We sample N = 100 points of {y{xi)} and quantize them with 


qi = —0.75, q 2 = —0.25, qs = 0.25, q 4 = 0.75, zi = —oo, Z 2 = —0.5, z^ = 0, Z 4 = 0.5, Z 5 = -|-oo 

This example involves two types of constraints: box type (two sided) constraints and one 
sided constraints. In implementing our Spherical HMG algorithms, we transform the sub¬ 
space formed by components with both finite lower and upper limits into unit ball and map 
the subspace formed by components with one sided constraints to the whole space using 
absolute value (discussed at the end of Section . 

Figure [IT] shows the quantized Gaussian process (upper) and the estimates (lower) with 
10^ samples given by different MCMC algorithms. Overall, all the methods recover the 
truth well. Table summarizes the efficiency of sampling 1.1 x 10^ and burning the hrst 
10^ with RWM, Wall HMC, exact HMC, c-SphHMC and s-SphHMC. Exact HMC generates 
more effective samples but takes much longer time even though implemented in C. Spherical 
HMC algorithms outperform it in terms of time normalized ESS. Interestingly, Wall HMC 
performs well in this example, even better than exact HMC and c-SphHMC. 
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Figure 11: Quantized stationary Gaussian process (upper) and the estimates of the process 
(lower). 


Method 

AP 

s/iter 

ES S (min, med, max) 

Min(ESS)/s 

spdup 

RWM 

0.70 

7.11E-05 

(2,9,35) 

0.22 

1.00 

Wall HMC 

0.69 

9.94E-04 

(12564,24317,43876) 

114.92 

534.48 

exact HMC 

1.00 

l.OOE-02 

(72074,le-h05,le-h05) 

65.31 

303.76 

c-SphHMC 

0.72 

1.73E-03 

(13029,26021,56445) 

68.44 

318.32 

s-SphHMC 

0.80 

1.09E-03 

(14422,31182,81948) 

120.59 

560.86 


Table 3; Comparing efficiency of RWM, Wall HMC, exact HMC, c-SphHMC and s-SphHMC 
in reconstructing a quantized stationary Gaussian process. AP is acceptance 
probability, s/iter is seconds per iteration, ESS (min,med,max) is the (mini¬ 
mal,median,maximal) effective sample size, and Min(ESS)/s is the minimal ESS 
per second. 


5.5 LDA on Wikipedia corpus 

LDA (]Blei et ab 2003) is a popular hierarchical Bayesian model for topic modeling. The 


model consists of K topics with probabilities {vr^} drawn from a symmetric Dirichlet prior 
Dir(/3). A document d is modeled by a mixture of topics, with mixing proportions r]d ~ 
Dir(a). Document d is assumed to be generated by i.i.d. sampling of a topic assignment, 
Zdi, from rjd for each word Wdi in the document, and then drawing the word Wdi from the 
assigned topic with probability tt^^- ( ]Patterson and Teh 2013). Teh et al. (2006) integrate 
out rj analytically to obtain the following semi-collapsed distribution: 


D 


p{w,z,-K\a,^) = 


T{Ka) 


Li ^(Aa Ud..) r(a) 


K 

n 


r(a + n.o,-) A r(WI3) 


W 


n ± (yv -r-r 

r(^)w ii 


k=l 


TT' 


d+n.i 

kw 


0-1 
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Figure 12: Test-set perplexity and computation time (in log scale) based on the Wikipedia 
corpus. 


where Udkw = = w,Zdi = k). Here, denotes the summation over the corre¬ 

sponding index. Given tt, the documents are i.i.d so the above equation can be factorized 
as follows (jPatterson and Teh 2013): 


D 


K 


p{w, z, 7r\a, /3) = p{tt\/3) p{wd, Zd\a, vr), p{wd, Zd\a, tt) = ]^ 


d=i 


k=l 


r{a + ndk- 

T{a) 


w 

n 

w=l 


TT 


■'^dk'u 

kw 


(51) 


To evaluate our proposed methods, we compare them with the state-of-the-art method of 
Patterson and Teh (2013). Their approach, called stochastic gradient Riemannian Langevin 


dynamics (sg-RLD) is an extension of the stochastic gradient Langevin dynamics (SGLD) 


proposed by Welling and Teh (2011). Because this approach uses mini-batches of data to 


approximate the gradient and omits the accept/reject step of Metropolis-Hastings while 
decreasing the step size, we follow the same procedure to make our methods compara¬ 
ble. Further, because Langevin dynamics can be regarded as a single step Hamiltonian 
dynamics (Neal, 2011), we set L = 1. We refer the resulting algorithms as sg-SphHMC 


and sg-SphLMG, which are modified versions of our SphHMC and SphLMC algorithms. 
sg-SphLMG uses the following stochastic (natural) gradient (gradient preconditioned with 
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metric) 


Qkn, = [(nL+/3-l/2)/0fc.+0fc.K.+W(/3-l/2))]/(2*nt), ^ 


d£Dt 


E 


Zd\wd,d,a 


[iT'dkw] 

(52) 

where 1/2 comes from the logarithm of volume adjustment. In contrast, the stochastic 
gradient for sg-SphHMC is ^gkwn%. (See Section 4.3). The expectation in Equation (52) 


is calculated using Gibbs sampling on the topic assignment in each document separately, 


given the conditional distributions (Patterson and Teh, 2013) 


( u\ a \ (® + 

piyZdi = k\wd., 0, a) = -rr- 


(53) 


where \i means a count excluding the topic assignment variable currently being updated. 
Step size is decreased according to e* = a(l + t/6)“'^. 

We use perplexity (Patterson and Teh, 2013[ [Wallach et al. 2009) to compare the 
predictive performance of different methods in terms of the probability they assign to unseen 
data, 


nd-. 


peTp{wd\W,a,P) = exp<^ - '^\ogp{wdi\W P)/nd.. >, p{wdi\W (3) = Er?rf,7r[y~l PdkT^kwd^] 


2=1 


(54) 

where W is the training set and Wd is the hold-out sample. More specifically, we use the 
document completion approach (Wallach et al., 2009), which partitions the test document 


Wd into two sets, and we then use to estimate Ud for the test document 

and use to calculate perplexity. 

We train the model online using 50000 documents randomly downloaded from Wikipedia 
with the vocabulary of approximately 8000 words created from Project Gutenburg texts 
([Hoffman et al.[ 2010). The perplexity is evaluated on 1000 held-out documents. A mini¬ 


batch of 50 documents is used for updating the natural gradient for 4 algorithms: sg-RLD, 
sg-wallLMC0 sg-SphHMCand sg-SphLMC. 

Figure [T^ compares the above methods in terms of their perplexities. For each method, 
we show the best performance over different settings (Settings for best performance are 
listed in Table [^). Both sg-wallLMG and sg-SphLMG have lower perplexity than sg-RLD 
at early stage, when relatively a small number of documents are used for training; as the 
number of training documents increases, the methods reach the same level of performance. 
As expected, sg-SphHMC does not perform well due to the absence of a proper scaling 
provided by the Fisher metric. 


6. Discussion 

We have introduced a new approach, spherical augmentation, for sampling from constrained 
probability distributions. This method maps the constrained domain to a sphere in an 
augmented space. Sampling algorithms can freely explore the surface of sphere to generate 

3. The stochastic gradient for sg-wallLMC is [(n^„ + fi — 1/2) -|- TVkw{nl. +W{I3 — l/2))]/nL 
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Algorithm 

a 

b 

c 

a 

P 

K 

Gibbs samples 

sg-RLD 

O.OI 

1000 

0.6 

0.01 

0.5000 

100 

100 

sg-wallLMC 

0.20 

1000 

2.0 

0.01 

0.5000 

100 

100 

sg-SphHMG 

0.01 

1000 

0.6 

0.01 

0.0100 

100 

100 

sg-SphLMC 

0.25 

1000 

1.5 

0.01 

0.5000 

100 

100 


Table 4: Parameter settings for best performance in Wikipedia experiment. 


samples that remain within the constrained domain when mapped back to the original space. 
This way, our proposed method provides a mathematically natural and computationally 
efficient framework that can be applied to a wide range of statistical inference problems 
with norm constraints. 

The augmentation approach proposed here is based on the change of variables theorem. 
We augment the original L)-dimensional space with one extra dimension by either inserting 
slack variables (c-SphHMC) or using embedding map (s-SphHMC), The augmented Hamil¬ 
tonian is the same under different representations (|30[)(|40[) due to the mathematical fact 
that the energy is invariant to the choice of coordinates (Proposition | A. ij ) . To account for 
the change of geometry, a volume adjustment term needs to be used, either as a weight after 
obtaining all the samples (SphHMC) or as an added term to the total energy (SphLMC). 

Our proposed method takes advantage of the splitting strategy to further improve com¬ 
putational efficiency. We split the Lagrangian dynamics and update velocity in the tangent 
space, rather than momentum in the cotangent space. This implementation avoids the 
requirement of embedding as in Byrne and Girolami (2013) and could be applied to more 
general situations. 

In developing Spherical HMC, we start with the standard HMC, using the Euclidean 
metric I on unit ball B^(l). Then, spherical geometry is introduced to handle constraints. 
One possible future direction could be to directly start with RHMC/LMC, which use a 
more informative metric (i.e., the Fisher metric Gp), and then incorporate the spherical 
geometry for the constraints. For example, a possible metric for the augmented space could 
be G-p+ 06^ However, under such a metric, we might not be able to find the geodesic 
fiow analytically, which could undermine the added benefit from using the Fisher metric. 

In future, we also intend to explore the possibility of applying the spherical augmentation 
to Elliptical Slice sampler (Murray et ah, 2010) in order to generalize it to Spherical Slice 
sampler (SSS). The resulting algorithm can be applied to truncated Gaussian process mod¬ 
els. In general, we can extend our proposed methods to infinite dimensional function spaces. 
This would involve the infinite dimensional manifold S°° := {/ G L^(H)| f = I}. In 
this setting it is crucial to ensure that the acceptance probability does not drop quickly as 
dimension increases (]Beskos et al. 


2011 ). 


27 
























Lan and Shahbaba 


Appendix A. Spherical Geometry 

We first discuss the geometry of the Ii)-dimensional sphere := {0 G : || 0||2 = 

1} ^ under different coordinate systems, namely, the Cartesian coordinate and the 

spherical coordinate. Since can be embedded (injectively and differentiably mapped to) 
in we first introduce the concept of ‘induced metric’. 

Definition 1 (indnced metric) If can be embedded to {m > d) by f : U C V ^ 
M, then one can define the induced metric, gx>, on TV through the metric gM defined on 
TM: 

9vi0)iu,v) = gMifiO))idf0{u),df0{v)), u,v€T0V (55) 

Remark 1 For any f : U C ^ we can define the induced metric through dot 

product on . More specifically, 

gs{u, v) = [{Df)u]^{Df)v = u^[{Dff {Df)]'v (56) 

where (D/)(£)_,_]^)xD ihe Jacobian matrix of the mapping f. A Metric induced from dot 
product on Euclidean space is called a “canonical metric”. This observation leads to the 
following simple fact that lays down the foundation of Spherical HMC. 

Proposition A.l (Energy invariance) Kinetic energy ^(v, v)q( 0 ) is invariant to the 
choice of coordinate systems. 

Proof For any v G T 0 V, suppose 9{t) such that 0(0) = v. Denote the pushforward of v 
by embedding map / : P —)■ At as v := /*(v) = ^(/ o 0)(O). Then we have 

^(v,v)g( 0 ) = ^5m(/(0))(v,v) (57) 

That is, regardless of the form of the energy under a coordinate system, its value is the 
same as the one in the embedded manifold. In particular, when A4 = the right hand 

side simplifies to gllvUl. ■ 


A.l Canonical metric in the Cartesian coordinate 


Now consider the D-dimensional ball . 6 ^( 1 ) := {6 G : || 0||2 < !}• Here, { 6 , Bq { 1 )} can 
be viewed as the Cartesian coordinate system for S^. The coordinate mapping Tq^s+ ■ 
6 ^ 6 = {6, 9d+i) in Q can be viewed as the embedding map into and the Jacobian 


matrix of Tj3^s+ is dTj3^s+ = = 

in the Cartesian coordinate, G5^(0), is 


Id 

-O'^IOd+i 


Therefore the canonical metric of 


Gsci^) = dTB^s+ = Id + 


ee^ 

^D+l 


= Id + 


ee^ 




(58) 
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Another way to obtain the metric is through the first fundamental form ds^ (i-e., squared 
infinitesimal length of a curve) for 5^, which can be expressed in terms of the differential 
form dO and the canonical metric 65 ^( 0 ), 

ds^ = {de,de)Gs^ = de'^GsA^)de 

On the other hand, ds"^ can also be obtained as follows (Spivak, 1979): 


ds'^ = dof = + idiOD+i{o))f = de'^do + = d6>^[i + ee'^i9Y^]d9 

1 ^ o 

i=l i=l " 


(Spivak, 1979 


Equating the above two quantities yields the form of the canonical metric 65 ^( 0 ) as in 
Equation (58). This viewpoint provides a natural way to explain the length of tangent 
vector. Eor any vector v = {v,vd+i) £ = {v G : 0 v = 0}, one could think of 

Gsc(9) as a mean to express the length of v in terms of v, 


v'^GsJ9)v = 


+ 


v'^do'^v 

‘^D +1 



( — ^D+lVP+l)^ 


(92 

'^D +1 




(59) 


This indeed verifies the energy invariance Proposition |A.l 

The following proposition provides the analytic forms of the determinant and the inverse 

of GsA9). 


Proposition A.2 The determinant and the inverse of the canonical metric are as follows 


\GsM\ = OnY GsSOy^ = 1d- 00^ ( 60 ) 


Proof The determinant of the canonical metric 65 ^( 0 ) is given by the matrix determinant 
lemma, 


\GsM\=det 


Id + 


66^ 

^D+l 


= 1 + 


e^e 

If 

^D+l 


1 


(92 

^D+l 


The inverse of 65 ^( 0 ) is obtained by the Sherman-Morrison-Woodbury formula (jGolub and 


Van Loan, 1996) 


GsM-' = 


Id + 


ee' 

dp+i 


1 -1 


= Id- 


00^/02 


D+l 


1 + 0 ^ 0/02 


= 1 ^- 00 ' 


D+l 


Corollary 1 


The volume adjustment of changing measure in ([^ is 


d0B 

d0s. 


GsA0)\--- 


\dD+l 


( 61 ) 
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Proof Canonical measure can be defined through the Riesz representation theorem by 
using a positive l inear functional on the space C o{S^) of compactly supported continuous 
functions on (Spivak 1979; do Carmo 1992). More precisely, there is a unique positive 


Borel measure /Tc such that for (any) coordinate chart {Bq {1),Tq^s. 





fCo)des^ 



fie)y/\GsM\d0B 


where fic = dOg^, and dOjg is the Euclidean measure. Therefore we have 


d^ 

dOs 


GsM\'^ = \dD+i\-^ 


Alternatively, 


dOs, 


\dD+ 


1|- 


A.2 Geodesic on a sphere in the Cartesian coordinate 

To find the geodesic on a sphere, we need to solve the following equations: 

0 = V 

V = -w'^TsX0)^ 


(62) 

(63) 


for which we need to calculate the Christoffel symbols, r5^(0), first. Note that the (i, j)- 
th element of is gij = 6ij + 9i9j and the {i,j,k)-th. element of dGs^ is gij^k = 

{dik9j + 9i6jk)l9‘jj_^_^ + 29i9j9k/9%j^^. Therefore 

^ij ~ 2^ [fl'bi* 9il,j ~ 9ij,l] 

= lid’‘-9^9^Wii9j+9i5ji)/0l^, + {5ij9i + 9Ai)/9l^^ - (Sadj + 9i6ji)/9l^, + 29i9j9i/9%^,] 
= {5’^^-9'^9%/9l^^[6i,+9i9,/9l^,] 

= 9k[5ij + 9i9j/9l)_^_i] = [G5^(0) ® 0]ijk 


Using these results, we can write Equation (63) as v = — v'''G5 ^(0 )v0 = — ||v||20. Further, 
we have 


dt 


dn+i =:7r\/l - ll^lli = “ 




9d+ 


e 


1 


= VD+l 


VD+l = 


d 

dt 9 d+i 


■ _ 2. 

+ W5-PD+l — —||V||2f7i?+l 


9d+ 


1 


^D+l 


Therefore, we can rewrite the geodesic equations (62) (63) with augmented components as 


e = v 

V = —||v ||20 


30 


(64) 

(65) 
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Multiplying both sides of Equation (65) by v 
system of differential equations as follows: 


to obtain ^||v| 


0 , we can solve the above 


e{t) = 0(O)cos(||v(O)||2t) + ||^^|^sin(||v(0)||2t) 
v(t) = -0(O)||v(O)||2sin(||v(O)||2t) + v(0)cos(||v(0)||2t) 


A.3 Round metric in the spherical coordinate 

Consider the D-dimensional hyper-rectangle TZq := [0,7r]^“^ x [0,27r) and the correspond¬ 
ing spherical coordinate system, {0,7^^}, for . The coordinate mapping Tti^^s '■ ^ 

X, Xd = cos{6d) nti sin(0j), d = 1, • • • , D -|- 1, (0d-i-i = 0) can be viewed as the embed¬ 
ding map into and the Jacobian matrix of is with the (d, j)-th element 

[— tan{9d)6dj + cot{6j)I{j < d)]xd- The induced metric of in the spherical coordinate is 
called round metric, denoted as 65 ^( 0 ), whose (i,j)-th element is as follows 


GsAOh 

D+l 

= ^ [-tan(0d)Jdi + coi{0j)I{i < d)][-ia\i{9d)5dj +coi{9j)I{j < d)]x^ 

d=i 

= tav?{9i)5ijx1 — tan(0i) col{9j)I{i > j)xj — iaa{9j) cot(0j)/(i < j)x‘j -|- cot(0i) cot{9j) ^ 

d>max{2 j} 

— taa{9j) cot(0i)x^ -|- cot(0j) cot(0j) x^ = 0, i < j 

d>j 

= < d-l 

tan^{9i)x‘f + cot^(6'i) ^Xrf = (tan^(6»j) -h l)x‘f = sin^(6'j), i = j 

, d>i i=l 

d-l 

= sm'^{9i)6ij 

i=l 

( 66 ) 

Therefore, G5,,(0) = diag[l, sin^(0i), ■ • • , 0?=!^ sin^(0d)]. Another way to obtain Gs^{6) is 
through the coordinate change: 


GsAO) = 


dOl 

dOs^ 


GsM 


dds. 

del 

Ot* 


Similar to Corollary 0 , we have 

Proposition A.3 The volume adjustment of changing measure in (10) is 

D-l 


dOrio 


dOs^ 


= |G 5 ,( 0 )| ^ = 0 sin 


-{D-d) 


{Od) 


d=l 


(67) 


( 68 ) 


Appendix B. Jacobian of the transformation between g-norm domains 

The following proposition gives the weights needed for the transformation from to 

^f(l)- 
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Proposition B.l The Jacobian determinant (weight) o/Tq^q is as follows: 

\‘lTs^e\ = [-) (^ni«<lj 

Proof Note 

Tb^q: 0^/3 = sgn(0)|0|2/'? 

The Jacobian matrix for is 

^ = -diag(|6»|2/9-i) 
de^ q ^ 

Therefore the Jacobian determinant of is 

djJ 


\dTB^Q\ = 


de' 


D / D \ 2 / 9-1 


ni«< 

u=i 


(69) 


The following proposition gives the weights needed for the change of domains from TZ^ 
to 


Proposition B.2 The Jacobian determinant (weight) ofTB^n is as follows.■ 

IdTB^Tll = 

Proof First, we note 


2 


Tb^ti = Tc^tz o Tb^c ■ 0 = 6 1-^ /3 = ^ ( 3 ' H-^ 


e\ 


The corresponding Jacobian matrices are 

d(3' \\0h 


Tb^c '■ 


dG^ 


\e\ 


i + e 


u-l 


flT 101 

C' arg max 1 6 \ 


1011^ e 


arg max \ 0 \ 


(70) 


where e. 


Tc^ti : — T = diag ( 

d{pf V 2 

argmax|0| is & vector with (argmax|0|)-th element 1 and all others 0. Therefore, 


{dTB^nl = \dTc^Ti \ \dTB^c\ = 


d(3 

df3' 

\\o\\D ^ / 

11^112 Y~[ 

diPf 

dO'^ 

~ \\f)\\D li 2 

II Iloo 
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Appendix C. Splitting Hamiltonian (Lagrangian) dynamics on S 


D 


Splitting the Hamiltonian dynamics and its usefulness in improving HMC is a well-studied 
topic of research (Leimkuhler and Reich, 2004 Shahbaba et al.[ 2014; Byrne and Girolami 


2013). Splitting the Lagrangian dynamics (used in our approach), on the other hand, has 


not been discussed in the literature, to the best of our knowledge. Therefore, we prove the 
validity of our splitting method by starting with the well-understood method of splitting 


Hamiltonian (Byrne and Girolami, 2013), 


H*ie, p) = ^uie) + + lui9) 


The corresponding systems of differential equations, 


0=0 

P = -l^eUiO) 


can be written in terms of Lagrangian dynamics in (0, v) as follows: 


0 = GsM'^P 

P = -\p'^GsM-^dGsMGsM~^P 


0=0 


1 


V = --GsM~"^eU{e) 


0 = V 

V = -v''’r5^(0)v 


We have solved the second dynamics (on the right) in Section A.2 To solve the first dy¬ 
namics, we note that 


d 


0^ 




= 0 


d 0'^v 


VD+l = 


0^v + 0^v 


dt 6 d+i 
Therefore, we have 

m = m 


m = v(o) - - 


9d+i 




0Tv 


02 ^D+1 

^D+1 ^ 


1 0 ' 


7D+1 


-GsSOr^^eUie) 


0{oy 

6d+i{0) 


[I - 0(O)0(O)^]V0C/(0) 


where 


ojoy 

Sd+i{0) ^ 


[I-0(O)0(O)T] = 


I - 0(O)0(O)'^ 


I 

-0D+i(O)0(O)'"_ 




-0(O)0(O)^. 


Finally, we note that ||0(t)||2 = 1 if ||0(O)||2 = 1 and v(t) G if v(0) G . 


Appendix D. Error analysis of Spherical HMC 


Following Leimkuhler and Reich (2004), we now show that the discretization error = 
||z(t„) — = ||(0(f„), v(t„)) — (0''"''’, v^"-))!! (i.e. the difference between the true solution 

and the numerical solution) is 0 {s^) locally and C(e^) globally, where e is the discretization 
step size. Here, we assume that f(0, v) := v'''r(0)v -|- G{ 6 )~^'V 0 U{O) is smooth; hence, f 
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and its derivatives are uniformly bounded as z = {0, v) evolves within finite time duration 
T. We expand the true solution z(t„+i) at tn- 

z{tn+l) = z{tn) + z{tn)e + + 0(6^) 

^ Oitn) 

v(t„) 


+ 


v(tn) 

-f(0(tn), v(in)) 


1 

"+2 


-f(0(t„), v(tn)) 


+ O(e^) 




We first consider Spherical HMC in the Cartesian coordinate, where f(0, v) = ||v|p0 + [I — 
96'^]VgU{9). From Equation (34) we have 


^(n+l/2) ^ ^(n) _ 

~(n+l/2) ||2 ^ II ~(n) ||2 _ e V gU {9^^^) + 0{e^) 


(74) 


Now we expand Equation (35) using Taylor series as follows: 


Q{n+ 1 ) ^ Q(n) 1 ^ _ II ~ (n+ 1 / 2 ) || 2 ^ 2/2 + 0(e^)] + - || v(^+V2) || 2 £ 2 / 3 , ^ 

^(rx+3/4) ^ _Q{n) || ~ (n+1/2) ||2^j^ _ || ~ (n+1/2) ||2g2/3, ^ ^ ^{n+1/2) _ || ~ (n+1/2) ||2g2/2 + 


Substituting (74) in the above equations yields 


0 (n+l) ^ Qin) ^ ^(n+ 1 / 2 )^ _ ^(n) || ~ (n+1/2) || 2 g 2/2 + 0 {e^) 

= + 0(e3) 

^(n+3/4) _ ^(n+1/2) _ g(n) ||^(n+l/2) ||2^ _ ^(n+1/2) ||^(n+l/2) ||2^2 _|_ 0(£^) 

= - [(I - 9^^\9^^^f)VgU{9^^^)/2 + ]e 

+ [6»W(v(”))’^V 0C/(6>(’^)) - v(”)||v(”)||V2]e^ + ©(e^) 


With the above results, we have 

v(»^+l) =^{n+3/4) _ £^j _ ^(n+l)^^{n+l)^T^y^^^^(n+l)|j 

_ 1[(I _ + v(’^)(6»(”))^)V0C/(6»(^))]e2 + O(s^) 

=vW - f(6»(’^), v("))e - v(’"))e2 + 0(£^) 
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where for the last equality we need to show 0 U{ 6 ^^'^) = —2^|p. This can be 

proved as follows: 

^l|vf = ^[||vf+ n|5+i] = 2[-v^f + RB+ihD+i] 


= 2 


T-> / 0 


-v'f+ 


n + h2- ^D+1 VD+1 

t'D+l ^D +1 J 


= -2 

= -2 

= -2 
Therefore we have 


v-:A±i(,Vf+:A±i||vn2 


Gd +1 

VD+l 


Gd+i 


vT0_^(||0||2_l) 

f^D+l 


iviP + (v-^ 0 ) [i-ee']VeU{e)] 


^r'VeU{e)+{-^r'e-'^{l-\\ef)]e'VeU{e) 


e 


D+l 


'Q{n+iy 


■0W 



1 

■_f(6)W^v(’^))' 

v("'+l) 


vH 

-1- 

_-f(0(”),vW)_ 

"+2 

_-f(0("),vW)_ 


= -2v^V0C/(0) 


+ 0{e^) (75) 


The local error is 

en+l = ||z(t„+i) - 


e{tn) - 

Lv(tO - v(-). 


+ 


v(t„) - v(-) 


1 

"+2 


-[f(t„)-fW] ^2 


-[f(t„) - fW]_ 


e" + 0 {e^) 


(76) 


< (1 + M\£ + M 2 £^)(in + 


where Mk = Cfcsnp^g[o,r] A; = 1,2 for some constants Ck > 0. Accumu¬ 

lating the local errors by iterating the above inequality for L = T/e steps provides the 
following global error: 

ei+i < (1 + Mi£ + M 2 £^)eL + 0{e'^) < (1 + Mie + M 2 £^)‘^eL-i + 30{£^) < ■■■ 


< (1 + Mie + M 2 £^rei + LO{e^) < + T)e^ ^0, as £ ^ 0 

For Spherical HMC in the spherical coordinate, we conjecture that the integrator of 
Algorithm still has order 3 local error and order 2 global error. One can follow the same 
argument as above to verify this. 


Appendix E. Bounce in diamond: Wall HMC for 1-norm constraint 


Neal (2011) discusses the Wall HMC method for oo-norm constraint only. We can however 


derive a similar approach for 1-norm constraint. As shown in the left panel of Figure 13 


given the current state Oq, HMC makes a proposal 6 . It will hit the boundary to move from 
9q towards 6 . To determine the hit point ‘X’, we are required to solve for t G (0,1) such 
that 


D 

|0o + {0- 9o)t\\, = Y,\Gt + {G^ - Gt)t\ = 1 

d=l 


( 78 ) 
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Figure 13: Wall HMC bounces in the 1-norm constraint domain. Left: given the current 
state Oq, Wall HMC proposes 6 , but bounces of the boundary and reaches 6 ' 
instead. Right: determining the hitting time by monitoring the first intersection 
point with coordinate that violates the constraint. 


One can find the hitting time using the bisection method. However, a more efficient method 
is to find the orthant in which the sampler hits the boundary, i.e., find the normal direction 
n with elements being ±1. Then, we can find t, 

11^0 + {0 - do)t\\i = n'''[0o + {0 - 9o)t] = 1 => t* = j —^ (79) 

n'(t/ - t/oj 

Therefore the hit point is 6 q = Oq + (6 — 6 Q)t* and consequently the reflection point is 

0' = 0 - 2n* (n*, 0 - 0'o) = 6» - 2n(n'^0 -l)/D (80) 

where n* := n/||n ||2 and n'''0g = 1 because 0 q is on the boundary with the normal direction 
* 

n . 

It is in general difficult to directly determine the intersection of 0 — 0o with boundary. 
Instead, we can find its intersections with coordinate planes {T^d}d=i^ where '■= {0 G 
= 0}. The intersection times are defined as T = {9q/{9q — 9‘^)\9q / 9'^}. We keep 
those between 0 and 1 and sort them in ascending order (Figure [T^ right panel). Then, we 
find the intersection points {0^ := 0o + (0 — 0o)7fc} that violate the constraint ||0|| < 1. 
Denote the first intersection point outside the constrained domain as 0^. The signs of 0^ 
and 0fc-i determine the orthant of the hitting point 0g. 

Note, for each d G {1,---D}, (sign(0^),sign(0^_;^)) cannot be (+, —) or (—,-|-), oth¬ 
erwise there exists an intersection point 0 * := 0o -|- (0 — 9o)T* with some coordinate 
plane between 0^ and 9k-i- Then Tfc-i < T* < Tk contradicts the order of T. 
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Algorithm 3 Wall HMC for 1-norm constraint (Wall HMC) 

Initialize at the current state 6 after transformation 
Sample a new velocity value ~ AA(0, Id) 

Calculate 

for ^ = 1 to L do 

v(^+|) = vW - fV0l7(0(^)) 

0 (^+ 1 ) = + £v(^+i) 

set hit false 
while ||0^^^|| > 1 do 

find coordinate plane intersecting times: T = {T^ := ^ 

sort those between 0 and 1 in ascending order: T = {0 < '\< 1} 

find the first point in { 0 ^ := 0 ^^^ -|- — 6^^^)Tk} that violates || 0 || < 1 and 

denote it as 6k 

set normal direction as n = sign(sign( 0 fc) -|- sign( 0 fc_i)) 
find the wall hitting time t* = (1 — n''' 0 ^^^)/(n'''( 0 ^^“'"^^ — 0 ^^^)) 

0 W ^ + ( 0 (^+ 1 ) _ and 0 (^+ 1 ) ^ 0 ^^+^^ - 2 n(n, 0 (^+i) - 0 W)/||n||| 

set hit ^ true 

end while 
if hit then 

v(^+i) ^ ( 0 (^+ 1 ) - 0W)||v(^+5)||/||0(^+l) - 0 W|| 

end if 

vC+i) ^ _ |V 0 [/( 0 (^+^)) 

end for 

Calculate = t/(0(^+^)) + A(v(-^+i)) 

Calculate the acceptance probability a = min{l,exp[—+ v^^))]} 

Accept or reject the proposal according to a for the next state O' 


1^ Therefore any point (including 0g) between 0^ and 0fc_i must have the same sign as 
sign(sign(0fc) sign(0fc_i)); that is 

n = sign(sign(0fc) -t- sign(0fc-i)) (81) 

After moving from 6 to O', we examine whether O' satisfies the constraint. If it does not 
satisfy the constraint, we repeat above procedure with 0o O'q and 0^0' until the final 
state is inside the constrained domain. Then we adjust the velocity direction by 

Algorithm summarizes the above steps. 


4. The same argument applies when Tk = 1, i.e. G is the first point outside the domain among {Ok}- 
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